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Abstract — This paper studies the problem of distributed com- 
putation over a network of wireless sensors. While this problem 
applies to many emerging applications, to keep our discussion 
concrete we will focus on sensor networks used for structural 
health monitoring. Within this context, the heaviest computation 
is to determine the singular value decomposition (SVD) to extract 
mode shapes (eigenvectors) of a structure. Compared to collecting 
raw vibration data and performing SVD at a central location, 
computing SVD within the network can result in significantly 
lower energy consumption and delay. Using recent results on 
decomposing SVD, a well-known centralized operation, into 
components, we seek to determine a near-optimal communication 
structure that enables the distribution of this computation and the 
reassembly of the final results, with the objective of minimizing 
energy consumption subject to a computational delay constraint. 
We show that this reduces to a generalized clustering problem; 
a cluster forms a unit on which a component of the overall 
computation is performed. We establish that this problem is NP- 
hard. By relaxing the delay constraint, we derive a lower bound to 
this problem. We then propose an integer linear program (ILP) to 
solve the constrained problem exactly as well as an approximate 
algorithm with a proven approximation ratio. We further present 
a distributed version of the approximate algorithm. We present 
both simulation and experimentation results to demonstrate the 
effectiveness of these algorithms. 

Index Terms — Networked Computing, Wireless Sensor Net- 
works, Structural Health Monitoring, Clustering, Degree- 
Constrained Data Collection Tree, Singular Value Decomposition. 



I. Introduction 

Over the past decade, tremendous progress has been made 
in understanding and using wireless sensor networks. Of 
particular relevance to this paper are extensive studies on in- 
network processing, e.g., finding efficient routing strategies 
when data compression and aggregation are involved. How- 
ever, many emerging applications, e.g., body area sensing, 
structural health monitoring, and various other cyber-physical 
systems, require far more sophisticated data processing in 
order to enable real-time diagnosis and control. 

This leads to the question of how to perform arbitrary 
(and likely complex) computational tasks using a distributed 
network of wireless sensors, each with limited resources both 
in energy and in processing capability. The answer seems 
to lie in two challenges. The first is the decomposition of 
complex computational tasks into smaller operations, each 
with its own input and output and collectively related through 
a certain dependency structure. The second is to distribute 
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these operations among individual sensor nodes so as to 
incur minimal energy consumption and delay. This networked 
computing concept is a natural progression from the networked 
sensing paradigm. 

Previous results on establishing the communication structure 
for in-network computation either consider only simple func- 
tions like max/min/average/median/boolean symmetric func- 
tions [1|-|5| that do not fully represent the complex computa- 
tional requirements demanded by many practical engineering 
applications, or study asymptotic bounds which do not yield 
algorithms to determine the optimal communication structure 
for a given arbitrary network |6)-||9]. 

In this paper we address the second challenge of finding the 
optimal communication structure for a given decomposition of 
a computational task. While there are certain underlying com- 
monalities, this is in general an application specific exercise 
dependent on the actual computation. To keep our discussion 
concrete, we will study this problem within the context of 
structural health monitoring (SHM). SHM is an area of rapidly 
growing interest due to the increasing need to provide low-cost 
and timely monitoring and inspection of deteriorating national 
infrastructure; it is also an appealing application of wireless 
sensor technologies. 

The most common approach in SHM to detect damage is 
to collect vibration data using a set of wireless sensors in 
response to white/free input to the structure, and then use the 
procedure of singular value decomposition (SVD) to determine 
the set of modes ifTOl - lfTZl . A mode is a combination of a 
frequency and a shape (in the form of a vector), which is the 
expected curvature (or displacement) of a surface vibrating at 
a mode frequency. In this study, we will use SVD as a primary 
example to illustrate an approach to determine how to perform 
such a complex computational task over a network of resource- 
constrained sensors. Compared to collecting raw vibration 
data (or the FFT of raw data) and performing the SVD 
computation at a central location, directly computing SVD 
within the network can result in significant reduction in both 
energy consumption and computational delay. Conceptually, 
this reduction occurs because the output of SVD, a set of 
eigenvectors, is much smaller in size than its input, FFTs from 
individual sensor data streams, and evaluating multiple smaller 
SVDs in parallel is much faster than evaluating the SVD on 
the input from all sensors. 

In a recent result, Zimmerman et al. Ifl3l proposed a method 
to decompose the SVD computation, a classical centralized 
procedure. Here, we examine how to obtain an optimal 
communication structure corresponding to this decomposition, 
where optimality is defined with respect to minimizing energy 



consumption subject to a computational delay constraint. We 
show that this reduces to a generalized clustering problem, and 
here lies the generality of the results presented in this paper. In 
essence, a centralized operation is decomposed into a number 
of computational elements (or operators) each operating on a 
set of inputs. The optimization problem is then to figure out 
what set of inputs to group together (forming a cluster), and 
what relationship needs to be maintained (in the form of data 
sharing or message passing) between clusters according to the 
specification of the decomposition. For this reason, while SVD 
is the example throughout our discussion, the methodology 
used here is more generally applicable. Note also that SVD 
itself is an essential ingredient in a broader class of signal 
processing algorithms, including classification, identification 
and detection fOl-fllfl. 

Our main results are summarized as follows. 

1) We formally define the above networked computing 
problem and establish that it is NP-hard (in Section Hill. 

2) We derive a lower bound by relaxing the delay constraint 
and show that the optimal solution to the unconstrained 
problem has a simple structure that sheds light on the 
original problem (in Section ITTTb - 

3) We present an integer linear program (ILP) to solve the 
constrained problem as well as an approximate algo- 
rithm with a proven approximation ratio (Section ITVk 
we also present a distributed version of the approximate 
algorithm (Section ITVT l. 

4) We use both simulations and experiments to evaluate our 
algorithms (Section [vTh . 

We end this introduction with a simple example to illustrate 
that different computational objectives will have different 
optimal communication structures. We compare the optimal 
routing for data compression, and for computing SVD. As- 
sume that compression converts 2 input streams of size R bits 
each to an output stream of R + r where r < R 1 19). The 
SVD operator, as discussed in detail in Section HVl converts k 
input streams of size R bits each, into k eigenvectors of size 
r bits each with r < R. Consider the simple 4-node topology 
of Figure Q] and the two possible communication structures, 
with node being the base station and assume all links 
are of unit length/cost. As derived in 11191 . data compression 
requires an exchange of 3R + 3r (using successive encoding) 
and 4R + r bits respectively for the communication structures 
(a) and (b). Hence, if R > 2r, then (a) is better. On the other 
hand, in the case of SVD if we do not perform in-network 
computation, then sending all raw data to node results in a 
cost of 6R and 5R over the two structures, respectively. If we 
perform in-network computation, then as detailed in Section 
IIVI the resulting costs are 3R + 6r and 3R + 3r for the two 
structures, respectively. Hence (b) is always better for the SVD 
computation. 

II. Problem Formulation and Preliminaries 

In this section, we first introduce the relevant background on 
structural health monitoring, then present the network model 
and formally introduce the problem. 
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(a) (b) 

Fig. 1 . In-network computation and compressed sensing can have a different 
optimal communication structure, (a) and (b) represent the two possible 
communication structures for a simple 4-node topology. 

A. Background on Structural Health Monitoring 

During the past two decades, the SHM community has 
become increasingly focused on the use of the structural 
vibration data to identify degradation or damage within struc- 
tural systems. The first step in determining if the vibration 
data collected by a set of sensors represents a healthy or an 
unhealthy structure is to decompose the spectral density matrix 
into a set of single-degree-of-freedom systems. Assuming a 
broadband white input to the system, this can be accomplished 
by first obtaining an estimate of the output power spectral 
density (PSD) matrix for each discrete frequency by creating 
an array of frequency response functions using the Fast Fourier 
Transform (FFT) information from each degree of freedom. 
Early studies in this field focused on identifying changes 
in modal frequencies or the eigenvalues of the PSD matrix 
using the peak picking method ll20l to detect damage in large 
structural systems |2"T1 . More recent studies have observed 
that viewing changes in modal frequencies in combination 
with changes in mode shape information (eigenvector of the 
PSD matrix) makes it increasingly possible to both detect and 
locate damage within a variety of structural types and config- 
urations ifTOl - irTSl . One of the most widely used method for 
mode shape estimation is the frequency domain decomposition 
(FDD) method proposed by Brincker et al. |f22l . This method 
involves computing the SVE0 of the PSD matrix to extract the 
eigenvectors/mode shapes. 

The most common implementation of the FDD method 
over a wireless sensor network is to have each sensor send 
its vibration data to a central sensor node which computes 
the SVD of the PSD matrix and then distributes the mode 
shapes back to each sensor. This method requires significant 
computational power and memory at the central node as 
well as a significant energy consumption in the network to 
communicate all this data to the central node. For example, if 
there are 100 sensor nodes in the network, this implementation 
requires the central sensor node to compute the SVD of a 
100 x 100 PSD matrix as well as having each of the 100 sensor 
nodes send all their vibration data to one central node. 

Zimmerman et al. (T3l proposed an alternative implemen- 
tation by decomposing the computation of SVD (graphically 
represented in Figure |2]). Each sensor node is assumed to 
be aware of the eigenvalues of the PSD matrix (which have 

'Please see Appendix lAl for a description of the SVD computation. 
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been determined using the peak-picking method Q) and the 
FFT of its own sensed data stream. Denoting the entire set 
of nodes as V, if a sensor has the FFT of JV C V, \N\ > 1 
sensors and all the eigenvalues, then it can compute the SVD 
of the PSD matrix using \N\ sets of FFT results and determine 
\N\ eigenvectors. Let another sensor node be in possession of 
the FFT of N' C V, \N' > 1 sensors. It can perform a similar 
computation to determine \N'\ eigenvectors. To be able to 
combine results from these two computations to construct the 
\NUN'\ eigenvectors, one needs to be able to determine the 
appropriate scaling factors. This notion is precisely given in 
the following. 

Definition 1: Two computations are called combinable if 
one can determine the appropriate scaling factors to combine 
them. A computation on N nodes and another computation on 
N' nodes is combinable if and only if either NON' ^ (j) (that 
is, there is at least one common sensor in N and AO, or there 
exists another computation on N" nodes which is combinable 
with both N and N'. 
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Fig. 2. Decomposing the computation of SVD using in-network computation. 

If R denotes the size in bits required to represent the FFT 
of a sensor stream and r denotes the size in bits to represent a 
eigenvector, each SVD computation which combines the FFT 
of k sensor streams reduces the number of bits from IcR to kr. 
Note that the size of the output stream does not depend on R 
but only on r, which depends on the size of the network. 

B. A Generalized Clustering Problem 

With the above decomposition, it can be seen that the asso- 
ciated communication problem can be cast as a generalized 
clustering problem: the solution lies in determining which 
subset of sensors {cluster) should send their FFTs to which 
common node {cluster head), who then computes the SVD 
for this subset, such that these subsets have the proper overlap 
to allow individual SVDs to be scaled and combined. We 
consider this clustering problem generalized because due to 
the combinability requirement clusters will need to overlap, 
e.g., the cluster head of one cluster can also be the member 
of another cluster. The resulting hierarchy is thus driven by 
this requirement rather than pre-specified as in many other 
clustering approaches (e.g., the two-layer clustering in LEACH 

2 Finding the optimal communication structure to implement peak-picking 
in a distributed manner turns out to be the same as in the case of data 
compression 1191 . 1231 . 1241 . and is thus not studied here. 



|25l ). However, should this requirement be removed then the 
clustering problem becomes a fairly standard one. 

C. Network Model and Problem Definition 

We will proceed to assume a network model of an undi- 
rected graph denoted as G(V,E). Each (sensor) node in V acts 
as both a sensor and a relay. If two nodes can successfully 
exchange messages directly with each other, there exists an 
edge e G E between them. Let there be a weight w e > 
associated with each edge which denotes the energy expended 
in sending a packet across this edge. Without loss of generality, 
we take node to be the central node or the base station. 
We also assume that all sensors (including the base station) 
are identical in their radio capability (and hence have the 
same energy consumption per bit). This is done to keep the 
presentation simple and can be easily relaxed. 

Each node has a local input vibration stream. The goal is 
to evaluate the SVD of the PSD matrix formed by the input 
vibration streams of all the sensors. We define a sensing cycle 
to be the time duration in which each sensor performs the 
sensing task to generate a vibration stream, the SVD is then 
computed and the mode shapes are made known at the base 
station. The length of this cycle as well as how this procedure 
may be used in practice are discussed in Section [V] Our 
objective is to determine the optimal communication structure 
to minimize the energy consumption in a sensing cycle under 
a constraint on the maximum duration of a sensing cycle. 

The two metrics of interest, namely energy consumption 
and computational delay are precisely defined as follows. 

Energy consumption is defined as the total communication 
energy consumed in the network in one sensing cycle. Let 
Ej x and Er x denote the energy consumed in transmitting and 
receiving a bit of data. For convenience we will denote E/, = 
Ejx+Erx- Then the energy consumed in transmitting a packet 
of B bits over an edge e is w e BEy 0. 

Computational delay is defined as the time it takes to 
compute a designated function at a sensor node. As observed 
in fOl , 11261 , the computational time is the chief contributor to 
delay as packet sizes in sensor systems tend to be very small. 
Thus, the duration of a sensing cycle depends primarily on the 
maximum computational delay amongst all sensor nodes. 

A constraint on the computational delay essentially trans- 
lates into a constraint on the maximum number of FFT's which 
can be combined at a node. We now formally introduce the 
problem. 

Problem PI: Find (1) the set S of sensor nodes on which 
the SVD computation will take place, (2) for each s G S, their 
corresponding set N s of sensor nodes whose FFT will be made 
available at s, and (3) a routing structure, so as to minimize 
E, the total energy consumed, subject to the constraint that 
\N S | < n s , V* G S, where n s denotes the maximum cluster size 
allowed at node s that corresponds to its computational delay 
constraint, and that the computations on all pairs s\ , S2 G S are 
combinable. 

3 Our algorithms and analysis do not depend on the exact model used for 
energy consumption, provided that it remains a function of the number of bits 
transmitted; more is discussed in Section Fvl 



4 



The set S will also be referred to as the set of cluster 
heads, and set N s the cluster associated with head node s. 
In the above description we have imposed individual delay 
constraints. Note that the computational delay of one round 
of SVD is dominated by the largest delay among all nodes 
if the computations of successive rounds are pipelined. One 
could also try to minimize the maximum computational delay 
with a constraint on the energy consumption. Indeed, it can 
be shown that the dual of the linear programs we propose will 
optimally solve this alternative formulation. 

A decision version of PI can be shown to be NP-hard 
through a reduction from set cover. The proof is given in 
Appendix [B] 

Theorem 1: There is no polynomial time algorithm that 
solves PI, unless P = NP. 

III. A Lower Bound on the Value of PI 

To simplify presentation, in this section we will assume 
that the weights of all edges are equal. This is not restrictive 
as all bounds derived in this section can be easily modified to 
incorporate different weights. With this assumption, the energy 
consumed in sending data from one node to another merely 
depends on the number of hops on the path between them. 

Definition 2: A data collection tree (DCT) for G(V,E) is 
the spanning tree such that the path from each node v G V to 
the base station has the minimum weight. 

Compared to a minimum spanning tree (MST), a DCT offers 
minimum weight on each path to the root rather than over the 
entire tree. Since all weights are equal, a path of minimum 
weight is equivalent to that of minimum hop count. Let do{v) 
denote the hop count of node v € V in the DCT. 

The following lemma provides a lower bound on the mini- 
mum energy consumption for PI given any choice of S. 

Lemma 1: Consider PI defined on graph G(V,E), and a 
set of cluster heads S ^ <p, then a lower bound on the optimal 
energy consumption, denoted by E(S), is given by 

E(S)> ^(|V|-l)*+£(d (v)-l)r+|%W (1) 

Proof: For all nodes v 6 V\S, a message of size R 
(containing the FFT) needs to be transmitted from v to some 
node in S. This message goes over at least one hop to reach this 
node, after which the size reduces to r (the eigenvector). Since 
the minimum hop count from v to the base station is do(y), if 
the message of size R goes over one hop, the message of size 
r will go over at least do(v) — 1 hops. As R > r, the amount of 
transmission required includes |V| — |5| transmissions of size 
R and Y,veV\s(do( v ) ~ 1) transmissions of size r. 

In addition to the above, each of the \S\ computations 
needs to be combinable. This means that Vsi,S2 € S, either 
N SI HN S2 ^ or there exists another node S3 6 S such that the 
computations at nodes s\ and S3, as well as that at nodes S2 and 
S3 are combinable, respectively. To understand how many extra 
messages are needed to satisfy this constraint, we construct the 
following graph G S (S,E S ): an edge is added to E s between 
nodes si and S2, si,S2 G S, only if N Sl (1N S1 ^ (j>. Each edge in 
this graph thus represents at least one common node between 



N SI and N S2 ; each common node needs to transmit a message 
of size R to both si and S2. 

It follows that each edge implies at least one extra transmis- 
sion of size R in addition to the |V| — \S\ transmissions of size 
R computed earlier. Two nodes in si,si e S are combinable if 
and only if there exists a path between si and S2 in G S (S,E S ). 
For a path to exist between every pair of nodes, G S (S,E S ) 
needs to have at least \S\ — 1 edges. This means at least \S\ — 1 
extra transmissions of size R are required for all pairs si,S2 E S 
to be combinable. Taking this into account, at least \V\ — 1 
transmissions of size R and Lvev\s(^o(v) — 1) transmissions 
of size r have to take place. 

Finally, computed eigenvectors from nodes v G S each goes 
through at least do(v) hops. Combining all of the above yields 

E(S) > [(|V|-1)J?+ £ (rf (v)-l)r+£rf (v)r £ fc 

\ veV\S veS J 

= [(|V|-l)*+£(£fo(v)-l)r+|S|rj£ft . (2) 



An interesting observation is that the lower bound only 
depends on the size of S and not its membership. One way to 
get close to this bound is to limit the delivery of any FFT to a 
single hop and route the FFT and the subsequent eigenvector 
along shortest paths. This motivates a particular solution for 
any given tree structure. 

Definition 3: Consider a graph G(V,E) and a routing tree 
T. Define a communication structure Ap2(T) as follows: (1) 
All non-leaf nodes in T constitute the set S, (2) cluster N s ,s G S 
consists of all immediate children of s G S, and (3) each node 
sends its own FFT to its parent node on T, and a node s G S 
sends eigenvectors for itself as well as its children along T to 
the base station. This will be referred to as tree solution T. 

We next consider an unconstrained version of PI, i.e., by 
removing the computational delay constraint. We refer to this 
unconstrained problem as P2. 

Lemma 2: Consider P2 defined on a graph G(V,E), and a 
routing tree denoted by T defined on the same graph. Let dj (v) 
denote the hop count of node v G V in T. Then tree solution 
Ap2(T) is feasible and has an energy consumption 



E AP2 (T)=[ (\V\ 



l)R+Y,(d T (v)-l)r+\S\r )E b . (3) 

veV I 



Proof: We first show feasibility, i.e., each pair of nodes 
Si,S2 G S are combinable. Since S consists of all non-leaf nodes 
on a tree, there exists a path between any pair of these nodes. 
Thus Ap2(T) is feasible. 

Next since each node (except for the base station) sends its 
FFT to its parent, this results in a cost of (|V| — l)REbl each 
non-leaf node computes the SVD from its children's FFT and 
its own, and then sends the eigenvectors to the base station, 
resulting in £ v€ y (dr(v) — l)r + \S\r bits. Putting everything 
together yields the lemma. ■ 

This lemma suggests that of all solutions given by a tree 
structure, the one that minimizes both dr(v) and \S\ will 
result in the smallest energy consumption. This motivates the 
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construction of a DCT (which minimizes dr(v)) that has a 
minimum number of non-leaf nodes (which minimizes |5|). 

Definition 4: A minimum non-leaf node data collection 
tree, or MDCT, defined on graph G(V,E) is a DCT that has the 
smallest number of non-leaf nodes among all DCTs defined 
on G(V,E). We will denote this tree as Tm- 

A key property of an MDCT is that it is impossible to move 
all the children of wa non-leaf node v G V on Tm to other non- 
leaf nodes of height < dj{v). This is because if this could 
be done then we can effectively reduce the number of non- 
leaf nodes on Tm, which is a contradiction. Figure [3] gives an 
example: both (a) and (b) are DCTs on the same graph, but 
the former is not a MDCT while the latter is. 




(a) Number of non-leaf nodes = 4 (b) Number of non-leaf nodes = 3 

Fig. 3. Two data collection trees for the same network. The solid lines 
represent the edges of the tree. Tm is the tree in (b). 

Theorem 2: Consider P2 defined on G(V,E), and an asso- 
ciated MDCT Tm with cluster head set S. Under the condition 
R > 2r, an optimal solution to P2 is given by Ap2(5*f)0 

Proof: From Lemma [2] we know that Ea p7 (Tm) matches 
exactly the lower bound given in (|2j. Consider any other 
solution with a cluster head set S' such that \S'\ > \S\. By 
Lemma Q] E(S') > Ea p1 (T) so any solution with a larger set 
S' is no better. 

Consider next any solution with a set S" such that \S"\ < \S\. 
By Lemma Q] using S" instead of S reduces the energy by no 
more than (|5| — \S"\)rEb. On the other hand, consider a node 
v G S and v £ S". By the property of the MDCT T M , there are 
only three possibilities in how the children of v can send their 
FFT under the new solution S": (1) each child of v sends its 
FFT to some node v" G S" with do(v") = do(v) via a single 
hop; (2) at least one child of v sends its FFT to a v" G S" 
with do(v") > do(v) via a single hop; (3) at least one child 
of v sends its FFT over at least d > 2 hops to reach v" G S" 
with do(v") +d > do(v) + 1. Denote these sets as Vi, V% and 
V 3 , respectively. Note that \S\ = \SDS"\ + \Vi\ + \V 2 UV 3 \. 

In case (1), at least one such v" G S" cannot be in S, and 
these v" nodes will be distinct for different v G S, $ S" nodes, 
for otherwise it contradicts the definition of an MDCT. Thus 
for each such v G S, $ S" there corresponds a v" G S",$. S. 
Therefore case (1) does not contribute to any reduction in 
energy consumption compare solution S" to S. Thus, \S"\ < \S\ 
can only be true if either (2) or (3) is true for some v G S, £ S"; 
in other words, \S\ — \S"\ = V2UV3I. For each such v, if it falls 
under case (2) then there is an energy increase (from S to S") 
of at least rE/, due to the height increase of v" over v; if it 
falls under case (3), the energy increase is at least (R — r)Eb > 

4 The condition R > 1r is easily satisfied in SVD computation for SHM. 



rEj, by the condition stated in the theorem. Thus the total 
energy increase is at least rEj, for each v G V2UV3; therefore 
the increase is at least (\S\ — \S"\)r. Hence any solution with 
a smaller set S" is no better, completing the proof. ■ 
To summarize, an MDCT yields the optimal solution for P2, 
which also serves as a lower bound to the value of PI. Note 
that in this solution the overlap between clusters is through 
cluster heads; all cluster heads (except for the base station) is 
a member of another cluster. 

IV. Exact and Approximate Algorithms 

We next present an integer linear program (ILP) to solve PI 
exactly and a O (log (\V\)) approximation algorithm for PL 

A. An Exact ILP for PI 

We first introduce optimization variables used in the ILP. 

The set of variables Xu,i,j G V define both the sets S and 
N s ,Vs G S as follows. Xjj := 1 if the FFT of node i is evaluated 
at node j (i.e. ; G Nj), and otherwise, xu := 1 if i G S and 
otherwise. 

Pijk := 1 if the FFT of node k is evaluated at both nodes i 
and j. This notation is used for convenience of presentation 
only as it is completely determined by Xij,i,j G V. 

Finally, the variables c, 7 „ recursively verifies the combin- 
ability relationship between two nodes i, j G S as follows: 

!1 if n = and ZkevPijk > L 
1 ifO<n<|V| and 
LkeV c ik(n-l)- c jk(n-l) + c ij(n-l) — L 
otherwise. 

Thus cm = 1 if the pair i,j G S share common nodes in their 
respective clusters, cu\ — 1 if the pair i,j either share common 
nodes directly or each shares common nodes with a common 
third cluster, and so on. If the pair i, j G S are combinable, we 
will have cy(iyi_i) = 1. 

Finally, Wjj denotes the sum weight of all edges along the 
shortest path from node i to node j. 

The ILP below solves PI exactly, where the minimization 



is over the choice of Xij,Vi,j G V. 

(ILP_P1) ramZwj&xijE^mj + rWjo) (5) 
s.t. 

£/eV,#i ir < x n < Ljev,#i x ji » Vi e V (6) 

LjevXij> l,ViGV (7) 

p .. k < x Ji±*iyij )keV (8 ) 

Cijo < Lkev Pijk , Vi, j G V (9) 

c i7(|V|-i)) > x a + x .jj - IjViJ G V (10) 

tijkn < Cik(n - 1)+ 2 CMn - l) ,Vi,j,k G V,0 < n < \V\ (11) 

Cijn < Cy( B -l)+Lfc£V^7*(n-l)>Vi,j G V*,0 < It < \V\ (12) 

c iin = 0,\fieV,0<n< \V\ (13) 
LieV x ij 

<nj,Vj€V (14) 

Xij,Cijk,Pijk,tijkn e {0, l}Vi, G V,0 <n< \V\ (15) 

The objective (Eqn (|5]l) is fairly straightforward: if the FFT 



of node i is sent to node j, it costs RWijEj,. The FFT from i 
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produces a unique eigenvector of size r at node j as a result 
of this SVD computation, which costs rWjoEt, to send to the 
base station. 

The first constraint (Eqn ©) sets the value of xu to 1 if Ni ^ 
<p, and otherwise (note that if Ni^ 0, then 1 < Y.jev.j^i x ji < 
\V |). Eqn (|7]i ensures that the FFT of every sensor node is sent 
to at least one node. Eqn ((8) ensures that = 1 if the FFT 
from node k is sent to both nodes i and j. 

The next five constraints ensure the combinability of the 
solution by limiting the value of c,;„. Eqn (O ensures that 
CjjQ = 1 if there is at least one node common to N, and Nj. 
Eqn ( [TOt states that if both i, j G S, the computations at i and j 
should be combinable. Eqns (fTTT i and ( fT2b populate the value 
of Cij n . Note that f i; x„ is a temporary variable introduced to 
express the quadratic condition in Equation j4) as a linear 
function. Note that the presence of Eqn ( fTOb forces Eqns (O, 
©, ( fTTT i. and ([T2l to assign the maximum possible value to 
the LHS; similarly, the presence of the latter forces ( [Tol l to 
assign the minimum possible value to the LHS. 

Eqn (Tl~3T > sets the value of c,,„ to zero for every i G V, < n < 
V |. This prohibits a corner case where c,j„ is set to 1 by setting 
c U(n-i) to 1 without ensuring that the computation at ; and j 
are combinable. Finally, Eqn < TT~4T > imposes the computational 
delay constraint at each sensor node. 

B. Degree-Constrained DCT: Problem P3 

In this and the next two subsections we will develop a 
O (log (|V|)) approximation to the optimal solution of PI. To 
simplify the presentation, we will again assume that all edge 
weights are equal. Note that all the algorithms proposed in 
this section can be easily modified without changing their 
approximation factors to incorporate different weights. 

The basic idea is to first use a DCT to find a feasible solution 
to PI. A feasible solution requires that each cluster is size- 
limited due to the computational delay constraint: a node v 
cannot have more than n v — 1 immediate children. This leads 
to the following definition. 

Definition 5: A degree-constrained data collection tree, or 
DDCT, is a tree T which minimizes Y,vevdr{v) under the 
constraint that a node v G V has no more than n v — 1 immediate 
children, where « V) Vv G V are given constants. 

Problem P3: Find a DDCT for G(V,E), which in turn 
determines the set S, clusters N s ,Vs G S, and the routing 
structure. 

That a solution to P3 is feasible for PI is obvious, but it 
may not be optimal for PI, even if it has the fewest non-leaf 
nodes among all DDCTs because a node may no longer be on 
its shortest path. 

It's worth noting that P3 is also NP-hard; it is APX-hard 
even when weights on edges satisfy the triangle inequal- 
ity |27l . Results on P3 are known only for complete graphs 
whose weights satisfy the triangle inequality 11271 . l28l . To the 
best of our knowledge our work here is the first to propose 
algorithms with proven approximation ratios for P3 in graphs 
induced by a communication network. 

We proceed as follows. We first present an ILP (ILP_P3) 
to solve P3 exactly. This ILP has much fewer variables and 



constraints than ILP_P1, and hence takes less time to solve. We 
then relax ILP_P3 to an LP and solve it via appropriate round- 
ing of fractional values. This rounding algorithm, referred to as 
algorithm LPR, is thus an approximation algorithm for P3, and 
therefore also an approximation algorithm for PI. We derive 
the approximation factor for LPR with respect to problem PI 
in Theorem[3] Finally, based on the intuition derived while an- 
alyzing LPR, we present a simpler, distributed approximation 
algorithm with the same asymptotic approximation factor. 

C. An ILP for Problem P3 

We define the following variables used in the ILP in finding 
a DDCT. For a given G(V,E), define a graph G(V,E) with 
directed edges, by replacing each undirected edge in E with 
two directed edges, one in each direction. Let O v ,v G V denote 
the set of outgoing edges from node v in E. Similarly, let 
7 v ,v G V denote the set of incoming edges into node v in E. 

The set of variables x e ,e G E define whether an edge is on 
the DCT as follows. x e := 1 if edge e is on the DCT, and 
otherwise. The variable f e ,e G E will be referred to as the flow 
value over the edge e; it denotes the number of nodes using 
edge e to reach the base station on the DCT. f e = if edge e 
is not on the DCT. 

The following ILP solves P3 exactly, where the minimiza- 
tion is over the choice of x ei Me G E. 



(ILP_P3) min £ eefi / e (16) 

Le £ I fe-LeeO fe = \V\-l (17) 

Leelje-Leeoje = -l,Vv G V\{0} (18) 

fe<(\V\-l)x e ,VeeE (19) 

LeeEXe = \V\-l (20) 

Z ee o v Xe = hVv€V\{0} (21) 

Leei r Xe<n v -l,\fveV (22) 

x e G {0, l},Ve G -E (23) 

/ e e{0,l,...,|V|-l},Vee£ (24) 



The objective function minimizes the total flow, which essen- 
tially minimizes Lv6V^r(v). 

The first two constraints ensure that each node sends a unit 
flow towards the base station. The third constraint forces f e 
to be if x e is 0, otherwise, it is redundant. Eqn d20l ) ensures 
that the output has exactly V| — 1 edges. Eqns d2Tb and ( l22l ) 
ensure that there is no more than one outgoing edge per vertex 
(other than the base station) and no more than n v — 1 incoming 
edges into vertex v. Eqns (1201 and ( T^TT i together ensure that 
the output is a tree and Eqn (|22| ) ensures that a node v has no 
more than — 1 immediate children. 

D. Algorithm LPR: an LP Rounding Approximation 

We next present a polynomial-time approximation algorithm 
which relaxes ILP_P3 to a linear program (LP), by allowing 
< x e < 1 and f e > to be fractional and appropriately 
rounding the fractional values. This algorithm is referred to 
as LPR and shown in Figure [4] 
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NV = {Q}, NE = (f>, h = 0, assign h v = -1, Vv e V\{0} and 
h = 

while (NV\ = V) do 
h = h + 1 

Solve the ILP for P3 with fractional variables 
and the additional constraint that x e — \.\/e^NE 
For Vv 6 NV and h r = h-\ 

If the value of x e for more than (n v — 1) 
incoming edges at v is greater than 
Set the largest (n v — 1) x e values 
amongst the incoming edges at v to 1 
(ties are broken arbitrarily) 
Otherwise 

Set the x e value of all incoming 
edges at V to 1 
Add the edges for which x e was set to 1 
in the previous step to NE 
For all edges added to NE in the 
previous step, add the node v from 
which the edge emanates to NV and 
assign h v — h 

Fig. 4. Algorithm LPR: The LP rounding approximation algorithm for P3. 
E. The Approximation Factor of LPR 

Even though LPR makes no assumptions on the network, 
our derivation of the approximation factor assumes the follow- 
ing: (1) «„„■„ > 3, where n m ,„ = min vfE yra v , (2) the unconstrained 
MDCT constructed over G(V,E) has a height of O (log(|V |)), 
and (3) nodes can transmit to each other if the distance 
between them is less than a transmission range R tx . 

To understand (1), note that if = 2,Vv £ V, then each 
node has at most 1 child and the constructed tree is thus linear 
(a chain). The problem subsequently reduces to the traveling 
salesman problem. Similarly, if most nodes disallow more than 
1 child, the resulting tree will be close to linear, which is not 
a very interesting routing structure to study. Finally, and more 
importantly, most existing sensor platforms have sufficient 
computational power to quickly combine FFT's from at least 
3 nodes, easily satisfying this assumption. Assumption (2) is 
also easily satisfied as sensor networks used for SHM are in 
general not very sparse. Assumption (3) is very commonly 
adopted for analytical tractability. However, our analysis is 
not heavily dependent on this assumption (more is discussed 
in the footnote in the proof of Lemma |4|i and the same 
approximation factor also holds under more realistic physical 
layer assumptions. 

We next derive the approximation factor of LPR with respect 
to PI. The analysis is based on the observation that the 
approximation factor is essentially the ratio between the height 
of the DDCT constructed using LPR and the height of the 
MDCT (discussed in more detail in the proof of Theorem |3J. 

Denote the height of the MDCT by h or i g and the height of 
the DDCT generated by algorithm LPR hddct- Define a non- 
full node to be a node v at height h < hddct which has less 
than 7i r — 1 children. A height 1 < h < hddct is defined to be 
a non-full height if there exists at least one non-full node at 
height h. We then have the following lemma. 

Lemma 3: Consider running algorithm LPR on a set of m 
nodes with a randomly selected base station and a topology 
such that the maximum set of nodes that cannot transmit to 
each other has a size p. Then the resulting DDCT cannot have 
more than p non-full heights. 

Proof: We prove this by contradiction. Let there be p + 1 
non-full heights: h\ < ... < h p+ \. Let v, be a non-full node at 



height hi, I < i < p + 1. Then, Vi,vj, 1 < i < j < h p+ \ cannot 
transmit to each other, for otherwise LPR would have labeled 
Vj as the child of V/. Thus none of the nodes vi,...,Vp+i can 
transmit to each other. However, by assumption we cannot 
have more than p nodes which cannot transmit to each other, 
thus a contradiction. ■ 
Lemma 4: Under the assumption that the height of the 
MDCT h orig = 0(log(|VQ), the height of the DDCT con- 
structed by LPR is h ddct = © (log(|V |)). 

Proof: By the construction of the MDCT, the maximum 
distance of a node from the base station is h or j g R t ^. Using 
geometric arguments similar to the ones used in PP . it's easy 
to show that the set of nodes none of which can transmit 
to each other has a size of no more than r- < 



COS 1 ^ 


1 ■> \ , ,) 




2c 2 log 2 (|K|) J 



~ 27Tclog(|V I), for some constant c, where 



the equality follows from the small angle approximation 

COs(x) Wl-y. 

Thus by Lemma[3] there are no more than 27Tclog(|V|) non- 
full heights. At the same time, the number of full heights is 
0(log(|V|)) by definition. Hence h ddct = @(log(|V|)). ■ 

Theorem 3: The approximation factor of LPR is 
O0og(|V|)). 

Proof: To derive the approximation factor, we compare 
the energy consumed in the DDCT constructed using LPR 
(given by Lemma O to the lower bound on the optimal 
solution of PI (given in Lemma [TJ. First, we note that 
\S\ > (^7r~~) m tne optimal solution and \S\ = c\ (tt" - ) m 
the DDCT (as h dc ict = © (log(|V|))) where c\ is a positive 
constant, n max = max ve y« v and «,„;„ = min ve y« v . Thus, the 

. , . £vev(^c/(v)-l)+ci(/-) , . | | . 

approximation factor is < ; — / V "\" < l°g( W LK 

I,ev(d (v)-l) + ( i — ) 

where d t idcr(v) denotes the hop count of node v in the DDCT. 
The final inequality holds because h org < C2log(|V|) and 
hddct = C3 log( | V | ), for some positive constants C2 and C3. 
Hence the approximation factor is O (log(|V |)). ■ 

F. A Distributed Approximation Algorithm (DAA) 

The approximation algorithm LPR is centralized as it re- 
quires solving an LP globally. We now present a simpler, 
distributed algorithm with the same asymptotic approximation 
factor. 

The proof of Lemma [3] uses the following observation from 
LPR: at height if there exists a node v with more than n v — 1 
neighbors which are not yet a part of the tree, the algorithm 
will add n v — 1 children to it. Otherwise, all its neighbors not 
yet a part of the tree will be added as its children. 

Using this intuition, we propose a modified Dijkstra's 
shortest path algorithm DAA in Figure [5] This algorithm 

5 Note that due to fading effects, the transmission range may not be a 
constant. However, there will always exist distances So and Ri such that 
if two nodes are within So of each other, they can transmit to each other with 
negligible loss, and if they are more than R\ apart, they cannot exchange 
packets with each other 1291 . 1301 . Rq and Ri may be much smaller and 
larger respectively than the actual transmission range; replacing R tx by these 
constants appropriately allows the same argument to go through for a more 
general physical layer model. 
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satisfies the observation made in the previous paragraph, hence 
Lemma[3]holds, and so do Lemma|4]and Theorem[3] Thus, the 
approximation factor for DAA is also 0(log(|V|)). The tree 
is built top down from the root with each node v choosing its 
n v — 1 children arbitrarily. Hence, like any shortest path algo- 
rithm l32l it can be built by message exchanges only between 
neighboring nodes. We will compare this modified Dijkstra's 
algorithm with LPR through simulation in Section [VI] 

NV = {0} r /i,, = «. VveV\{0}, /i = 0, C„ = 0, VveV. 

(C v denotes the number of children of node v.) 
while (NV. = V) do 

For each edge e^E such that e connects 

nodes V e NV and v' £ V\NV and C„ <n v - 1 
h[, =min{h[.,h v + \) 

Vmin = argmin,,{/),. | Vv £ V\NV} 

Add v„„„ to JW. 

Let the parent of Vmin be Vparent • Update 

Set /j v = ~, VveV\NV 
Fig. 5. Algorithm DAA: Modified Dijkstra's algorithm for P3. 



V. Discussion 

In this section we discuss a number of ways to relax 
the assumptions used earlier, as well as the applicability of 
distributed SVD computation in practice. 

The energy model presented in Section IH-CI is rather sim- 
plistic; it does not capture energy expended in overhearing 
etc. However, as long as the energy model is a linear function 
of the amount of bits transmitted per node (most energy 
consumption models fit this characterization), the proposed 
algorithms can be directly applied without any change in their 
optimality or approximation factors. 

The model and algorithms presented here can also be 
easily extended to include additional constraints, including 
accuracy and storage. In our decentralized SVD computation, 
the eigenvectors are determined by linearly combining those 
computed locally at different sensor nodes. If the sensors are 
noiseless, then the eigenvectors computed using this decom- 
position will exactly match the actual eigenvectors. However, 
the presence of noise in the sensed values can lead to errors 
in the computation l33l . This is because in a centralized 
implementation, a least-squares effect minimizes the error due 
to noise across all eigenvectors, whereas the decentralized 
implementation allows this error to accumulate through each 
combination of locally computed eigenvectors. 

The larger the number of FFT's being combined at each 
sensor node, the smaller this error. Hence a desired accu- 
racy will impose a constraint on the minimum cluster size 
\N S , s 6 S. This is the opposite of the delay constraint, and 
incorporating it in our models is quite straightforward. Denote 
this constraint by n a , i.e., \N S \ > « a ,Vs G S. Then in ILP_P1, 
the following constraint is added: Y l iev x ij > n aXjj>Vj £ V. 
Similarly, in ILP_P3, we add (1) ^eei v x e > ("« - l)/ v ,Vv € V, 
where variable /,, E {0, 1 } is set to 1 if v is a non-leaf node, and 
(2) Lee/,, ^e/ 1^1 < h < Heei r x e, ^ v <= ^> to ensure that l v is set 
1 only if v is a non-leaf node. Finally, the two approximation 
algorithms, LPR and DAA, can both be easily modified to 
maintain the number of children of each node in the data 



collection to be greater than n a — 1 . The effect of an added 
accuracy constraint will be examined in numerical studies 
presented in Section [VI] 

As the number of FFT's being computed at a node increases, 
not only the delay but also the storage required increases l33l . 
A storage constraint acts in a way very similar to the de- 
lay constraint: it essentially bounds the maximum number 
of FFT's that can be combined at a sensor. Therefore to 
incorporate this constraint we simply need to upper bound 
the value of |jV s | to be the lesser of the two, which results in 
an identical problem. 

While our discussion has centered solely on the computa- 
tional task of SVD, our approach is more generally applicable. 
Once a computational task is represented as a set of operators 
with associated input and output dependencies, one can use 
a very similar approach to seek the optimal communication 
structure, i.e., on which node to place which operator and 
along what path to send input to that node, etc. The resulting 
math program will in general be problem specific, but the 
solution philosophy is common; see Appendix ICl for a similar 
approach to decomposing the global operation of simulated 
annealing over a network of sensor nodes. 

While the proposed SVD computation can run continuously 
as a stream process, in practice it suffices to schedule it several 
times a day, each lasting on the order of minutes (the actual 
duration of the sensing cycle depends on the size of the FFT 
and the computation capacity of the sensors), as one does not 
in general expect mode shapes of a structure to change rapidly 
over time. Even though the task is performed infrequently, 
the saving in each operation is indeed significant (see results 
in Section IVII ). and the accumulated effect are undoubtedly 
beneficial for a sensor network to have a lifetime on the order 
of months or years. 

One weakness common to most in-network processing 
methods is that they typically deliver summaries or features 
of data rather than raw data itself; thus we potentially lose the 
ability to store and post-analyze the data (e.g., for an entirely 
different purpose than originally intended). In this sense, this 
type of operation is most advantageous when used in a real- 
time setting concerning instantaneous detection and diagnosis. 
For instance, a human inspector can use this approach (i.e., 
activate this SVD operation) to quickly check the mode shapes 
of a structure before deciding whether and what more (manual) 
inspection is needed. 

VI. Simulation and Experimentation 

We use both simulation and experimentation on a real 
sensor platform to evaluate the performance of the proposed 
algorithms. For simulation we use CPLEX 1341 to solve the 
ILPs, and all simulations are done on topologies generated 
by randomly distributing nodes in an area of 50 x 50m 2 and 
assuming the transmission range to be 30m. For the SVD 
computation, we use R = 8192 bytes and r = 32 bytes fl3l . 
We also assume that the computational delay constraint is the 
same for all nodes and n v = n,Vv 6 V. 

We first examine the effect of delay constraint n on energy 
consumption. Figures |6(a)| and |6(b)| compare the number of 
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(a) 



(b) 



(c) 






(d) 



(e) 



(f) 




(g) 



Lower Bound 
-e- ILP_P3 (n = 6) 
-B- DAA (n = 6) 

O ILP_P3 (n = 81 
DAA (n - 8) 




(h) 




(i) 



Fig. 6. Simulation Results, (a) |V| =4 (24576). (b) |V| = 6 (40960). (c) |V| = 10 (163840). (d) |V| = 30 (573440). (e) \V\ = 100 (1359872). (f) \V\ = 200 
(2342912). The number in brackets denotes the number of bytes transmitted in the network without in-network computation. Simulation Results with an 
accuracy constraint, (g) |V| = 5,n = 5. (h) |V| =30. (i) |V| = 200. 



bytes transmitted under the lower bound (Lemma[T|i, using the 
optimal communication structures derived by solving ILP_P1 
(Section lIV-Ah and using the three approximation algorithms 
ILP_P3 (Section HV-Cl LPR (Figure gji, and DAA (Figure ©, 
for different values of «, with [V [ = 4 and |V| = 6 respectively. 

We observe that the approximation algorithms perform 
very close to the optimal. It takes more than one hour of 
computation to solve the ILP_P1 for |V| > 6 on a 2.99 GHz 
machine with 4 GB of RAM. Hence for larger values of |V| 
we only compare the three approximation algorithms against 
the lower bound, shown in Figures |6(c)| and |6(d)| We note that 
(i) all approximation algorithms are within 3% of the optimal, 
and (ii) DAA outperforms LPR. These results also demonstrate 
the advantage of using the ILP_P3 over ILP_P1; it runs much 
faster and converges within an hour up to |V| =40. 

For even larger values of |V|, we compare the performance 
of DAA (as it consistently outperforms LPR) against the lower 
bound in Figures |6(e)| and |6(f)| We observe that it is always 
within 3% of the optimal. These results clearly demonstrate the 
advantage of in-network computation as the number of bytes 
transmitted over the network are reduced by more than half. 
Finally, Figures |6(e)| and |6(f)| also show the trade-off between 



communication energy and computational delay. The more 
delay allowed per node (larger the value of n), the smaller 
the energy consumed in the network. 

In Figures 6(g) |6(i)| we compare the performance of dif- 
ferent approximation schemes after incorporating an accuracy 
constraint in the formulation for different values of \V\, n and 
n a . In this scenario, we observe that ILP_P3 yields results 
within 5% of the optimal while DAA yields values within 
51% of the optimal. And the advantage of using a better 
centralized algorithm becomes more pronounced as the value 
of n a increases as any sub-optimal local decision in this 
scenario leads to an extra transmission of a FFT (R bits) and 
not just an eigenvector (r bits). 

We next evaluate the performance of DAA on a real sensor 
platform, the Narada sensing unit developed at the University 
of Michigan li3~5"l . This wireless device is powered by an 
Atmel ATmega 128 microprocessor. It is supplemented by 
128 KB of external SRAM and utilizes the 4-channel, 16- 
bit ADS8341 ADC for data acquisition. Narada's wireless 
communication interface consists of Chipcon CC2420 IEEE 
802.15.4 compliant transceiver, which makes it an extremely 
versatile unit for developing large-scale WSNs. This prototype 



10 






(a) 



(b) 



(c) 



Fig. 7. (a) Time to construct the tree vs n; \V\ = 12. (b) Time to construct the tree vs |V|; n = 4. (c) Maximum cluster size (corresponding computational 
delay in seconds) vs n\ \V\ = 12. 



is powered by a constant DC supply voltage between 7 and 9 
volts, and has an operational life expectancy of approximately 
48 hours with 6 AA batteries, given constant communication 
and data analysis demands. 

We use a testbed of 12 Narada sensor nodes deployed in a 
corridor in the Electrical Engineering and Computer Science 
building at the University of Michigan. Each Narada wireless 
sensor is programmed with DAA algorithm, and asked to 
autonomously form computational clusters with varying values 
of n. The root of the tree is randomly selected in each 
experiment. In a manner similar to |33l , the weight of an 
edge e is set to w e = ^—p^^sssi) < where RSSI is the radio 
signal strength indicator reported by the radio and pcf is 
the probability that a communication link with perfect RSSI 
fails due to unforeseen circumstances and is set to 0.1 for the 
Narada platform. 

The objective of our experiment is to study the time it takes 
to construct the data collection tree using DAA, as well as 
the cluster sizes and the corresponding sensing cycles as a 
function of n in a real- world setting. Figures |7(a)| and |7(b)| 
plot the time it takes to construct the tree as a function of 
n and \V\ respectively. We see that this time only depends 
on the size of the network. Figure |7(c)| plots the maximum 
cluster size as well as the maximum computational delay for 
the corresponding cluster size as a function of n. Figure [H] 
shows the data collection trees constructed for n = 3 and 
n = 5, respectively. To summarize, the implementation and 
the experimental results verify the feasibility of DAA in a 
real SHM sensor network. 




n=3 n=5 

Fig. 8. Data collection trees for n = 3 and n = 5. 



VII. Conclusions 

This paper studies the problem of networked computation 
within the context of wireless sensor networks used for 
structural health monitoring. It presents centralized ILPs and 
distributed approximation algorithms to derive optimal com- 
munication structures for the distributed computation of SVD. 
Both simulations and implementations are used to evaluate 
their performance. Our results demonstrate the advantage of 
in-network computation as it significantly reduces the amount 
of data transmitted over the network. 
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Appendix A 
A Brief Overview of Singular Value 
Decomposition 

Let A be a real mxn matrix with m > n. Then, the singular 
value decomposition (SVD) factors A as follows: A = UYV T 
where U T U = V T V = VV T = I„ and £ = diag{oi , ct 2 , • • • , 0- 
The matrix U consists of n orthonormalized eigenvectors as- 
sociated with the n largest eigenvalues of AA T , and the matrix 
V consists of the orthonormalized eigenvectors of A T A. The 
diagonal elements of E are the non-negative square roots of the 
eigenvalues of A T A. We shall assume that Ci > O2 > . . . > (J„. 
SVD comprises of two steps [36]: converting the matrix A into 
a bi-diagonal form, and using a variant of the QR algorithm 
to iteratively diagonalize this bi-diagonal matrix. 

Appendix B 
Proof of TheoremQ] 

First, the decision version of our problem is in NP: Given 
a communication structure, computing the energy consumed 
at each node and checking if the constraints specified in Eqns 
(ISTl-dPfli are satisfied can both be done in polynomial time. 
Hence, testing feasibility and whether the total cost is less 
than a given value M is accomplished in polynomial time. 

Next, to prove NP-hardness we perform a reduction from 
the set cover problem 1371 . whose decision version is defined 
as follows. 

Definition 6: Given a collection C of subsets of a finite set 
P and an integer < K < \C\, with \C\ the cardinality of C, the 
set cover problem asks whether C contain a subset of C' C C 
with \C'\ <K, such that every element of P belongs to at least 
one of the subsets in C' (this is called a set cover of P). 

For any instance of the set cover problem, we now build an 
instance of the decision version of problem PI, which seeks to 
find a communication structure with which the energy cost is 
at most M while satisfying the computational delay constraints 
at each node and the combinability constraint. 

Consider a graph consisting of three layers, as shown in 
Figure [9] a single node Vb at the bottom layer, a (C) layer 
of sets of nodes Q 6 C each with an internal structure as 
shown in Figure |9|b), and a (P) layer of nodes {pj S P}. 
Each element Q £ C in the middle layer contains the |Q| 
nodes in Q plus 3 extra nodes as shown in Figure [9jb): Node 
JC3 connects to the base station Vo with weight. Nodes x\ 
and X2 are connected to X3 with weights 1 and 1 < a < d, 
respectively, and connected with each other with weight d. 
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The other \C k \ nodes are connected to both x\ and X2 with 
weights d > 0. Furthermore, each such structure C k G C are 
connected to the same |Q| nodes in the P layer that belong to 
the set Cjt, via node jci , all with weight <i. Finally, all the X3 
nodes are inter-connected with weight 0. 

|C k | nodes 




(a) A graph instance. (b) Inner structure of each 

subset C k 

Fig. 9. Instance of the problem PI for any given instance of the set cover 
problem. In (b), the solid lines illustrate connectivity internal to the structure 
whereas dashed lines are for external connections. 

The delay constraints are defined as follows: no more than 
|Q| + 1 FFT's can be combined on nodes x\ and x%, and no 
more than 4 FFT's on node x$ for a given C k , and no constraint 
on Vrj or any nodes in the P layer. 

We are now ready to show that finding the solution to the 
decision version of PI on the above network graph with the 
stated delay constraints and a choice of 

M = dR\\P\+Y,\C k \J+R\C\ + aR\C\ 

+ar(|P|+Z) + r£(|Q| + l) , (25) 

k 

for some positive integer K < \C\, is equivalent to finding a 
set cover of cardinality K or less for the set P. 

For d > (R\C\+aR\C\+ar(\P\+K) + r1 k (\C k \ + l))/R, 
the communication structure for PI will have transmissions on 
exactly |P| edges between the layers P and C, and on exactly 
\C k \ edges in the structure shown in Figure |9|b) for every 
C k G C. That means no other node than X\,X2 and X3 will be 
used as a relay, or belong to S. If some other node belongs to S, 
then the cost of the communication structure would contain R 
bits passing through more than \P\ + Y,k (|Q |) edges of weight 
d which would result in a cost larger than M. This also implies 
that x\ and X3 for all C k G C belong to S. The only degree of 
freedom is whether xi lies in S or not. (Recall that X2 G S only 
if a SVD computation takes place on x^ also.) 

The key idea is to show that for 1 < a < d, finding a com- 
munication structure with cost at most M means connecting 
the nodes in layer P to at most K structures of layer C. If 
more than K structures in C is needed, then the cost of the 
communication structure will necessarily be higher than M. 

We first show that if a corresponding Q is not connected 
to any node in the P layer, then its corresponding X2 node 
will not belong to S. This is because in this case the optimal 
communication structure is to have all the other \C k \ nodes 
(other than x\,X2 and X3) send their data to x\ (since a > 
1, transmitting everything to x\ instead of x-i will consume 



less energy) who will then compute the SVD and send the 
corresponding eigenvectors as well as its own FFT to X3. Note 
that node X3 can receive FFT's from xi,X2 and other X3 nodes 
with no extra cost; thus combinability will be trivially ensured. 
It will forward all the computed eigenvectors to Vq. The total 
energy consumed in this operation is 

El=d\C k \R + aR+R + r{\C k \ + \). (26) 

We next show that if a corresponding Q is connected to 
at least one node in the P layer, then its corresponding X2 
will always belong to S. This is because since no more than 
\C k \ + 1 FFT's can be combined on x\, if n k of the nodes in the 
P layer send their FFT to xi, then x\ can combine FFT's from 
no more than \C k \— n k nodes belonging to the structure of C k . 
The remaining n k nodes will have to send their FFT to X2 as it 
has the next smallest distance (after xi) to these nodes. Thus, 
X2 will combine data from n k +\ nodes. The energy consumed 
in this scenario is 

E k =d\C k \R + dn k R + R + aR + ar(n k + 1 ) + r (|Q | + 1) . (27) 

Note that £ Ct n k = P. 

Thus, if the structures in the P layer connected to the C layer 
constitute the set F\ and those unconnected to the C layer the 
set F2, then the total energy consumed is 

e = L E l+L 4 

C k eFi C k EF 2 

= dRnP\+Y,\Ck\J +R\C\ + aR\C\ 

+^(1^1+^+^(1^1 + 1), (28) 

k 

where K 1 = \F\\. The above quantity will be larger than M if 
K' > K. This means that finding a communication structure 
with a cost at most M implies finding a set of K elements or 
less from the C layer to which all the nodes in set P connect. 
In other words, a communication structure with a cost of at 
most M yields a set cover of size at most K. 

Lastly, we need to ensure that the set cover of size at 
most K also yields a communication structure of cost at 
most M. If an element is contained in only one set in the 
set cover, then connect the corresponding node in P to the 
corresponding C k , and if an element belongs to multiple sets 
in the set cover, then choosing one of these sets uniformly 
at random, and connecting the corresponding node P to the 
C k which corresponds to this randomly chosen set, yields 
an energy cost of no more than M (follows obviously from 
the previous discussion). The computational delay constraint 
is also obviously satisfied at all nodes. We merely need to 
ensure that all computations are combinable. Since each node 
X3 belonging to the structure of C k send its FFT to the 
node X3 belonging to the structure of node C^ +1 j mo( j| C |, all 
computations are combinable. 

Thus our decision problem is NP-complete and our opti- 
mization problem is NP-hard. 
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Appendix C 
Parallel Simulated Annealing 

In a structural health monitoring system, a common tech- 
nique to translate raw sensor data into an estimate of damage 
involves comparing system properties in an unknown state 
of health to those in a known, undamaged state l38l . l39l . 
This technique is referred to as model updating and involves 
adjusting the system parameters iteratively in an analytical 
model such that the analytical system produces response 
data that matches results obtained experimentally. Using this 
method, damage can be detected in a system by periodically 
searching for changes in model parameters that can be linked 
directly to suboptimal system performance. 

A wide variety of model updating techniques have been 
developed over the years B51 . One common approach is to 
define an objective function, E, which relates the difference 
between analytical and experimental data. This function can 
be repeatedly evaluated with varying values of the analytical 
model parameters until the difference between the analytical 
and experimental response is minimized. 

Simulated annealing (SA) is one of the most common 
algorithms for stochastically searching for the global minimum 
of such an objective function. This method has been used 
frequently in model-based damage detection techniques [41]. 
Metropolis et al. Il42l developed this algorithm to determine 
the global minimum energy state amidst a nearly infinite 
number of possible configurations. The Metropolis criterion 
expresses the probability of a new system state being accepted 
at a given system temperature, and can be stated as: accept the 
new state if and only if E new < E y — Tin (U), where E is the 
value of the objective function for a given energy state, U is a 
uniformly distributed random variable between and 1, and T 
is the temperature of the system. The addition of the Tin (U) 
term allows the system to accept an invalid state in the hope 
of avoiding premature convergence to a local minima. 

A standard SA algorithm begins the optimization process by 
assigning an initial temperature T\, and letting the Metropolis 
algorithm run for Ni iterations. During each iteration, cer- 
tain analytical model parameters are reassigned in a pseudo- 
random fashion, and the objective difference between the 
experimental and analytical output is determined. This newly 
created state is either accepted or rejected based on the 
Metropolis criterion. After N\ iterations, the temperature of the 
system is reduced to T2 and the process runs for N2 iterations. 
This process continues till the temperature drops to a really 
low temperature, Tm, where very few new states are accepted, 
and the system has, in essence, frozen. To summarize, the 
process runs for M temperature steps, and for Nj, 1 < j <M 
iterations for each temperature step Tj, 

Over the years, many parallel SA techniques have been 
developed and successfully implemented l43l . Zimmerman et 
al. l44l proposed a new parallel SA technique more suited to 
be implemented over a wireless sensing system for structural 
health monitoring as it reduces the communication required 
between processing nodes. This technique breaks up the 
traditionally serial SA tree (which is continuous across all 
temperature steps) into a set of smaller search trees, each of 



which corresponds to a given temperature step and begins with 
the global minimum values for the preceding temperature step. 
Each of these smaller trees can be assigned to a cluster of 
available nodes in the network, and thus can run concurrently. 

As the parallelized search progresses, updated global state 
information has to be disseminated downwards (to the nodes 
doing the computations at lower temperatures) through the 
network. Specifically, when a node detects a new global 
minimum energy state at a given temperature, it communicates 
this information to the cluster-head of its cluster, which then 
propagates this information to all nodes doing the computa- 
tion at lower temperatures. These nodes (computing at lower 
temperatures) will re-start their search based on this new 
state. This may seem wasteful at high temperatures, however, 
as the search algorithm converges on a solution, it becomes 
decreasingly likely that a new global minimum will be found 
at a given temperature step which reduces the total number of 
transmissions. 

P4l explores the advantages of this approach in a wireless 
sensing system. However, it does not explore how to construct 
the communication structure so as to minimize the energy 
consumption which will be the focus of this section. 

We now precisely state the problem. The designer will 
set the values of Nj and kj, 1 < j < M, which denote the 
number of computations to be performed at temperature Tj 
and the number of sensor nodes performing the computation 
at Tj respectively. (Note that Zf=ikj < \V\.) The values of 
Nj and kj will be determined based on the accuracy and the 
computational constraint per node. 

Given the values of Nj and kj, determining the com- 
munication structure involves dividing the V nodes into M 
clusters each of size kj,l < j < M and choosing a cluster- 
head for each cluster. Let the cluster of nodes corresponding 
to temperature Tj be denoted by Kj. (Note that \Kj\ = kj.) 
Finally, let bj £ Kj denote the cluster-head for the cluster Kj. 
Any computation which results in a new minimum energy 
state at a temperature Tj requires exchanging this information 
between all nodes belonging to the cluster Kj, between the 
cluster-heads bj and b[,l > j, and all nodes belonging to 
clusters K/,l > j. Thus, the total number of transmissions for 
each new minimum energy state found at temperature Tj is 
equal to Z veK jH bj ^ v +I$L j+1 H b .^ bl +ZtL j+ i'LveK l H l>i^v' 
where recall that Hi^j,i,j £ V denotes the average number 
of transmissions required to exchange information between 
nodes i and j along the shortest path between the two nodes. 

We first describe an ILP to determine the optimal com- 
munication structure for parallel simulated annealing. Let 
Xij, i £V,l < j < M be an indicator variable which is set to 
1 only if node ; £ Kj. Let yij,i £ V, 1 < j < M be another 
indicator variable which is set to 1 only if node i = bj, 
that is, ; is the cluster-head for Kj. Note that here we have 
a separate variable to denote the cluster-head whereas for 
the SVD computation, we merely set jc„ to 1 if node i 
was a cluster-head. The extra variable is needed for parallel 
simulated annealing to convert the quadratic objective into a 
linear equation. Let t\^j,i,k £V,l < j <M denote an indicator 
variable which is set to 1 only if node i is the cluster-head for 
temperature Tj and node k £ Kj (that is t^j = yijX^j) and let 



14 



Pikj,i,k £ V, 1 < j <M — 1 denote an indicator variable which 
is set to 1 only if node i is the cluster-head at temperature 
Tj and node k is the cluster-head at temperature T; + \ (that 
is Pi*/ = yyyty+i)). Finally, let a h \ < j <M denote the 
probability of generating a new minimum energy state per 
computation at temperature Tj. Then, for Nj computations at 
that temperature, the number of new minimum energy states 
generated are UjNj. Note that generating a new minimum 
energy state triggers new transmissions. 

Following is the ILP to determine the optimal communica- 
tion structure for parallel simulated annealing. 
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Xij,yij,tikj,Pikj e {o,i},i,k ev,i<j<M. 

The first constraint (Equation d30l )) ensures that the cluster 
performing computations at temperature 7} has kj nodes while 
the next two constraints populate the values of t^i = yyXjy and 

Pikj = yijyku+iy 

We finally describe a greedy approximation algorithm to 
determine the communication structure for parallel simulated 
annealing. Recall that we need to determine the set of nodes 
which form a cluster as well the corresponding cluster-head 
for each temperature Tj, 1 < j < M. Figure [10] describes the 
greedy algorithm. We first start from the smallest temperature 
Tm because finding a new energy state at any temperature 
will trigger a transmission between the cluster-head bM and 
the nodes belonging to the cluster Km- Amongst all the nodes 
v G V, determine the cluster-head to be the node which has 
the smallest sum of the average number of transmissions 
required to get to kM nodes. This yields both bM and Km- From 
amongst the remaining nodes, in a similar manner, greedily 
select bM-\ and Km-\ and continue. Assuming the maximum 
height of the unconstrained data collection tree (defined in 
Definition |2} is 0(log(\V\)), using arguments similar to ones 
made in Section IIV-EI the approximation factor of the greedy 
approximation algorithm is also 0{log(\V\)). 



K = V, j = M 

while C/>0) 
minE = °° 

VveJf 



do 



(eiirt) = findMin (K,v,j) 

if [MinE > 4) 

MinE = el , 
= 7-1, K = K\Kj 



Kj = H , 



findMin (K,b,j) 

WeS 

Sort the nodes 



in K in ascending order of d v ' s 



Add the first kj — 1 nodes from this sorted list 
to T 

E = E„er Hv-*b + 1 j<MHb j+i^b 

(Ij<M is an indicator variable which is equal to 
1 if j<M, else it is equal to 0.) 
return (E.T) 



Fig. 10. A greedy approximation algorithm to determine the communication 
structure for parallel simulated annealing. 



